Goto

Collaborating Authors

 in-context example



Towards Reliable and Holistic Visual In-Context Learning Prompt Selection

Neural Information Processing Systems

Visual In-Context Learning (VICL) has emerged as a prominent approach for adapting visual foundation models to novel tasks, by effectively exploiting contextual information embedded in in-context examples, which can be formulated as a global ranking problem of potential candidates. Current VICL methods, such as Partial2Global and VPR, are grounded in the similarity-priority assumption that images more visually similar to a query image serve as better in-context examples. This foundational assumption, while intuitive, lacks sufficient justification for its efficacy in selecting optimal in-context examples. Furthermore, Partial2Global constructs its global ranking from a series of randomly sampled pairwise preference predictions. Such a reliance on random sampling can lead to incomplete coverage and redundant samplings of comparisons, thus further adversely impacting the final global ranking. To address these issues, this paper introduces an enhanced variant of Partial2Global designed for reliable and holistic selection of in-context examples in VICL. Our proposed method, dubbed RH-Partial2Global, leverages a jackknife conformal prediction-guided strategy to construct reliable alternative sets and a covering design-based sampling approach to ensure comprehensive and uniform coverage of pairwise preferences. Extensive experiments demonstrate that RH-Partial2Global achieves excellent performance and outperforms Partial2Global across diverse visual tasks.


Counterfactual reasoning: an analysis of in-context emergence

Neural Information Processing Systems

Large-scale neural language models exhibit remarkable performance in in-context learning: the ability to learn and reason about the input context on the fly. This work studies in-context counterfactual reasoning in language models, that is, the ability to predict consequences of a hypothetical scenario. We focus on a well-defined, synthetic linear regression task that requires noise abduction. Accurate prediction is based on (1) inferring an unobserved latent concept and (2) copying contextual noise from factual observations. We show that language models are capable of counterfactual reasoning. Further, we enhance existing identifiability results and reduce counterfactual reasoning for a broad class of functions to a transformation on in-context observations.


SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning

Neural Information Processing Systems

Multimodal in-context learning (ICL) remains underexplored despite significant potential for domains such as medicine. Clinicians routinely encounter diverse, specialized tasks requiring adaptation from limited examples, such as drawing insights from a few relevant prior cases or considering a constrained set of differential diagnoses. While multimodal large language models (MLLMs) have shown advances in medical visual question answering (VQA), their ability to learn multimodal tasks from context is largely unknown. We introduce SMMILE, the first expert-driven multimodal ICL benchmark for medical tasks.


Self-Generated In-Context Examples Improve LLMAgents for Sequential Decision-Making Tasks

Neural Information Processing Systems

Improving Large Language Model (LLM) agents for sequential decision-making tasks typically requires extensive task-specific knowledge engineering--custom prompts, curated examples, and specialized observation/action spaces. We investigate a different approach where agents automatically improve by learning from their own successful experiences without human intervention. Our method constructs and refines a database of self-generated trajectories that serve as in-context examples for future tasks.


Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations

Neural Information Processing Systems

Large language models (LLMs) can sometimes report the strategies they actually use to solve tasks, yet at other times seem unable to recognize those strategies that govern their behavior. This suggests a limited degree of metacognition -- the capacity to monitor one's own cognitive processes for subsequent reporting and self-control. Metacognition enhances LLMs' capabilities in solving complex tasks but also raises safety concerns, as models may obfuscate their internal processes to evade neural-activation-based oversight (e.g., safety detector). Given society's increased reliance on these models, it is critical that we understand their metacognitive abilities. To address this, we introduce a neuroscience-inspired neurofeedback paradigm that uses in-context learning to quantify metacognitive abilities of LLMs to report and control their activation patterns. We demonstrate that their abilities depend on several factors: the number of in-context examples provided, the semantic interpretability of the neural activation direction (to be reported/controlled), and the variance explained by that direction. These directions span a "metacognitive space" with dimensionality much lower than the model's neural space, suggesting LLMs can monitor only a small subset of their neural activations. Our paradigm provides empirical evidence to quantify metacognition in LLMs, with significant implications for AI safety (e.g., adversarial attack and defense).


CGBENCH: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research

Neural Information Processing Systems

Variant and gene interpretation are fundamental to personalized medicine and translational biomedicine. However, traditional approaches are manual and labor-intensive. Generative language models (LMs) can facilitate this process, accelerating the translation of fundamental research into clinically-actionable insights. While existing benchmarks have attempted to quantify the capabilities of LMs for interpreting scientific data, these studies focus on narrow tasks that do not translate to real-world research. To meet these challenges, we introduce CGBENCH, a robust benchmark that tests reasoning capabilities of LMs on scientific publications.


Supplementary for Paper2Poster: Benchmarking Multimodal Poster Automation from Scientific Papers

Neural Information Processing Systems

AAblation Study1 We conduct ablation studies to evaluate three key design choices in PosterAgent: (1) the binary-tree2 layout strategy for layout planning; (2) the inclusion of a commenter module as a visual critic; and3 (3) the use of in-context examples to enhance the visual perception capabilities of the commenter.4 We define the following variants:5 Direct: replacing the binary-tree layout with direct layout generation by an LLM;6 Tree: using the binary-tree layout strategy but removing the commenter module;7 Tree + Commenter: including the commenter module but without in-context examples;8 Tree + Commenter + IC: the full system, with both the commenter and in-context examples.9 All ablation variants are implemented using PosterAgent-4o, keeping all other components un-10 changed to isolate the effect of each factor. We visualize and compare results across five randomly11 selected papers from Paper2Poster, as shown in Figures 1 to 5.12 When prompting the LLM to directly generate poster layouts (Direct), the results are often structurally13 compromised (e.g., Figures 1a-3a), or resemble blog-style layouts that lack visual hierarchy and14 appeal (Figures 4a,5a). Fine-grained layout components, such as text boxes and figures, are especially15 challenging to synthesize in this setting: for instance, Figures1a-4a exhibit missing text boxes that16 leave noticeable blank areas, and Figure 4a fails to preserve the correct aspect ratio of figures.17


Transformers are almost optimal metalearners for linear classification

Neural Information Processing Systems

Transformers have demonstrated impressive in-context learning (ICL) capabilities, raising the question of whether they can serve as metalearners that adapt to new tasks using only a small number of in-context examples, without any further training. While recent theoretical work has studied transformers' ability to perform ICL, most of these analyses do not address the formal metalearning setting, where the objective is to solve a collection of related tasks more efficiently than would be possible by solving each task individually. In this paper, we provide the first theoretical analysis showing that a simplified transformer architecture trained via gradient descent can act as a near-optimal metalearner in a linear classification setting. We consider a natural family of tasks where each task corresponds to a class-conditional Gaussian mixture model, with the mean vectors lying in a shared $k$-dimensional subspace of $\mathbb{R}^d$. After training on a sufficient number of such tasks, we show that the transformer can generalize to a new task using only $\widetilde{O}(k / \widetilde{R}^4)$ in-context examples, where $\widetilde{R}$ denotes the signal strength at test time. This performance (almost) matches that of an optimal learner that knows exactly the shared subspace and significantly outperforms any learner that only has access to the in-context data, which requires $\Omega(d / \widetilde{R}^4)$ examples to generalize.


Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks

Neural Information Processing Systems

Improving Large Language Model (LLM) agents for sequential decision-making tasks typically requires extensive task-specific knowledge engineering--custom prompts, curated examples, and specialized observation/action spaces. We investigate a different approach where agents automatically improve by learning from their own successful experiences without human intervention. Our method constructs and refines a database of self-generated trajectories that serve as in-context examples for future tasks.